Generalized substring selectivity estimation
نویسندگان
چکیده
In a variety of settings from relational databases to LDAP to Web applications, there is an increasing need to quickly and accurately estimate the count of tuples (LDAP entries, Web documents, etc.) matching Boolean substring queries. In providing such selectivity estimates, the correlation between different occurrences of substrings is crucial. Selectivity estimation for generalized Boolean queries has not been studied previously; our own prior work, which is discussed and extended herein, applies to the case of onedimensional Boolean queries [CKKM00]. Existing methods for the case of multidimensional conjunctive queries approximate selectivities by explicitly storing cross-counts of frequently co-occurring combinations of substrings; estimates are obtained by parsing the query into multidimensional substrings corresponding to stored cross-counts and applying probabilistic formulae. The major problem with these methods is that the number of cross-counts stored by known methods increases exponentially with the number of dimensions (a ‘‘space dimensionality explosion’’) due to the need to capture the correlation amongst the dimensions. Hence, given a limited amount of space, none of the existing methods can reliably give accurate estimates. Moreover, these methods do not generalize to Boolean queries gracefully. We present a novel approach to selectivity estimation for generalized Boolean substring queries with a focus on the two cases of (1) conjunctive multidimensional and (2) Boolean queries. Our approach does not explicitly store crosscounts, but rather generates them on-the-fly. We employ a Monte Carlo technique called set hashing to succinctly represent the set of tuples containing a given substring as a signature vector of hash values; any combination of set hash signatures gives a cross-count when intersected. Thus, using only linear storage, a large number of cross-counts can be generated including those for complex co-occurrences of substrings. The cross-counts generated by our methods are not exact, but they are adequate for selectivity estimation. We present results from an extensive experimental evaluation of our approach on real data sets. For the case of multidimensional conjunctive queries, our approach achieves better accuracy by an order of magnitude, and scales much more gracefully to higher dimensions, than existing methods. Surprisingly, even though our approach involves generating cross-counts on-the-fly, estimation is very fast, taking 200 ms on a data set of size 6 MB: For the case of Boolean queries, our experiments also demonstrate the Corresponding author. E-mail addresses: [email protected] (Z. Chen), [email protected] (F. Korn), [email protected] (N. Koudas), [email protected] (S. Muthukrishnan). 0022-0000/03/$ see front matter r 2003 Published by Elsevier Science (USA). PII: S 0 0 2 2 0 0 0 0 ( 0 2 ) 0 0 0 3 1 4 superiority of this approach over a straightforward independence-based approach wherein correlations are not captured. r 2003 Published by Elsevier Science (USA).
منابع مشابه
Processing Queries on Road Networks in Spatial Data Base Perspective for Selectivity Estimation
This work mainly focuses on building a framework that is capable of analyzing spatial approximate substring queries, for mainly to solve the selectivity estimation problem of range queries which belongs to road networks represented in spatial databases. The selectivity estimation is nothing but estimating the size of the results i.e., estimating the number of points that presents in a graph whi...
متن کاملSubstring Count Estimation in Extremely Long Strings
To estimate the number of substring matches against string data, count suffix trees (CS-tree) have been used as a kind of alphanumeric histograms. Although the trees are useful for substring count estimation in short data strings (e.g. name or title), they reveal several drawbacks when the target is changed to extremely long strings. First, it becomes too hard or at least slow to build CS-trees...
متن کاملCXHist : An On-line Classification-Based Histogram for XML String Selectivity Estimation
Query optimization in IBM’s System RX, the first truly relational-XML hybrid data management system, requires accurate selectivity estimation of path-value pairs, i.e., the number of nodes in the XML tree reachable by a given path with the given text value. Previous techniques have been inadequate, because they have focused mainly on the tag-labeled paths (tree structure) of the XML data. For m...
متن کاملMulti-Dimensional Substring Selectivity Estimation
With the explosion of the Internet, LDAP directories and XML, there is an ever greater need to evaluate queries involving (sub)string matching. In many cases, matches need to be on multiple attributes/dimensions, with correlations between the dimensions. EEective query optimization in this context requires good selectivity estimates. In this paper, we use multi-dimensional count-suux trees as t...
متن کاملGeneralized Substring Compression
In substring compression one is given a text to preprocess so that, upon request, a compressed substring is returned. Generalized substring compression is the same with the following twist. The queries contain an additional context substring (or a collection of context substrings) and the answers are the substring in compressed format, where the context substring is used to make the compression...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- J. Comput. Syst. Sci.
دوره 66 شماره
صفحات -
تاریخ انتشار 2003